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Abstract 

We develop stochastic variational inference, a scalable algorithm for approximating 
posterior distributions. We develop this technique for a large class of probabilistic 
models and we demonstrate it with two probabilistic topic models, latent Dirichlet 
allocation and the hierarchical Dirichlet process topic model. Using stochastic 
variational inference, we analyze several large collections of documents: 300K articles 
from Nature, 1.8M articles from The New York Times, and 3.8M articles from 
Wikipedia. Stochastic inference can handle the full data, and outperforms traditional 
variational inference on a subset. (Further, we show that the Bayesian nonparametric 
topic model outperforms its parametric counterpart.) Stochastic variational inference 
lets us apply complex Bayesian models to very large data setsQ 

1. This paper is under submission. We welcome comments and suggestions. 
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1. Introduction 

Here are several examples of modern data analysis problems. (1) We have an archive 
of two million books, scanned and stored online. We want to organize these books by 
subject and build a navigator for users to explore our collection. (2) We have data 
from an online shopping website containing millions of users' purchase histories as 
well as descriptions of each item in the catalog. We want to recommend items to users 
based on this information. (3) We are continuously collecting data from an online 
feed of photographs. We want to build an classifier from these data. (4) We have 
measured the gene sequences and many traits of a large population. We want to make 
hypotheses about connections between genes and traits. 

These problems illustrate some of the modern challenges that we face when developing 
methods for analyzing data. Our data are complex and high-dimensional; we have 
assumptions to make — from science, intuition, or other data analyses — and these 
assumptions often take the form of the hidden structure, structure that we believe 
exists in the data but that we cannot directly observe; and finally our data sets are 
large, possibly even arriving in a never-ending stream. 

Statistical machine learning research has addressed some of these challenges by 
developing the field of probabilistic modeling, a field that provides an elegant approach 



Pearl 


1988; 


Jordan , 


1999, 


Bishop 



2006; Koller and Friedman, 2009). In particular, probabilistic graphical models give 
us a visual language for expressing assumptions about data, and the corresponding 
posterior inference algorithms let us analyze data under those assumptions. With the 
results of this field, we now enjoy a powerful suite of probability models to connect 
and combine. And we have general-purpose computational strategies for fitting our 
models and for estimating the quantities needed to use them. 

The problem we face is scale. Inference algorithms of the 1990s and 2000s used 
to be considered scalable, but they cannot easily handle the amount of data that 
we described in the four examples above. This is the problem we address in this 
paper. We present an approach to approximate posterior inference in a large class of 
probability models that is appropriate for massive data sets, data that might not fit in 
memory or even be stored locally. Our method does not require clusters of computers 
or specialized hardware, though it can be further sped up with these amenities. 

Our algorithm builds on variational inference, a method that turns complex inference 



problems into high-dimensional optimization problems (Jordan et al. , 1999). Tradition- 
ally, this optimization has been solved with a coordinate ascent algorithm, iterating 
between reanalyzing every data point in the data set and re-estimating parameters 
that summarize its hidden structure. However, this is inefficient for the large data sets 
we hope to consider. In this paper we will derive a more efficient algorithm by using 
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stochastic optimization (Robbins and Monro, 1951), an optimization technique that 



follows noisy estimates of the gradient of the objective. In the context of variational 
inference, this gives an efficient algorithm that iterates between subsampling the data 
and adjusting the parameters based only on the subsample. We call our method 

stochastic variational inference. 

We will derive stochastic variational inference for a large class of graphical models. 
We will study its performance on two kinds of probabilistic topic models, which are 
probabilistic models of text that can uncover the hidden thematic structure in large 
collections of documents. (For example, topic models are used in problem above.) 
In particular, we demonstrate stochastic variational inference on latent Dirichlet 
allocation (Blei et al. 2003), a simple topic model, and the hierarchical Dirichlet 



process topic model (Teh et al. , 2006a), a more flexible model where the number of 



discovered topics grows with the data. (This latter application demonstrates how to 
use stochastic variational inference in a variety of Bayesian nonparametric settings.) 
Stochastic variational inference can efficiently analyze massive data sets with complex 
probabilistic models. 



Technical summary. We now turn to the technical context of our method. In 
probabilistic modeling, we use latent variables z to encode hidden structure in observed 
data x; we articulate the relationship between the hidden and observed variables with 
a factorized probability distribution (i.e., a graphical model); and we use inference 
algorithms to estimate the posterior distribution of the hidden structure given the 
observations p(z \x). In descriptive tasks, like problems #1 and #4, the posterior 
provides a new way to explore the data — the organization of books or the connections 
between genes and traits — with the hidden structure probabilistically "filled in." In 
predictive tasks, like problems ^2 and #3, we use the posterior to form the predictive 
distribution of a new observation. In particular, the predictive distribution marginalizes 
out the hidden structure via the posterior and can then be used, for example, to make 
a recommendation to a user or to classify a new image. 

Consider a graphical model of hidden and observed random variables for which we 
want to compute the posterior distribution p(z | x). For many models of interest, this 
posterior is not tractable to compute and we must appeal to approximate methods. 
The two most prominent strategies in statistics and machine learning are Markov 
chain Monte Carlo (MCMC) sampling and variational inference. In MCMC sampling, 
we construct a Markov chain over the hidden variables whose stationary distribution 



is the posterior of interest (Robert and Casella, 2004). We run the chain until it 



has (hopefully) reached equilibrium and collect samples to approximate the posterior. 
In variational inference, we define a flexible family of distributions over the hidden 
variables, indexed by free parameters (Jordan et al. 1999). We then find the setting of 
the parameters (i.e., the member of the family) that is closest to the posterior. Thus 
the inference problem turns into an optimization problem. 



3 



Hoffman, Blei, Wang, and Paisley 



Neither MCMC nor variational inference scales to the kinds of settings described in 
the first paragraph. Researchers have proposed speed-ups of both approaches, but 
these usually are tailored to specific models or compromise the correctness of the 
algorithm (or both). Here, we develop a general variational method that scales. 



As we mentioned above, the main idea is to use stochastic optimization (Robbins 



and Monro 


1951; 


Spall 


2003 



an objective function by following noisy estimates of its gradient. If the expectation 
of the noisy gradient equals the true gradient then stochastic optimization provably 
converges to an optimum of the objective. Stochastic optimization is particularly 
attractive when the objective (and therefore its gradient) is a sum of many terms that 
can be computed independently. In that setting, we can repeatedly subsample the 
terms to give noisy gradients that are cheaper to compute than the true gradient. 

Variational inference is amenable to stochastic optimization because the variational 
objective decomposes into a sum of terms, one for each data point in the analysis. 
We can cheaply obtain noisy estimates of the gradient by subsampling the data and 
computing a scaled gradient on the subsample. If we sample independently then the 
expectation of this noisy gradient is equal to the true gradient. With one more detail — 



the idea of a natural gradient (Amari, 1998), a type of gradient that is particularly 
simple when applied to the variational objective — stochastic variational inference has 
an attractive form: 



1. Subsample one or more data points from the dataset. 

2. Analyze the subsample using the current variational parameters. 

3. Implement a closed form update of the variational parameters. 

4. Repeat. 



While traditional algorithms require repeatedly analyzing the whole data set before 
updating the variational parameters, this algorithm only requires that we analyze 
randomly sampled subsets. We will show how to use this algorithm for a large class of 
graphical models. 



Related work. Scalable posterior inference has received much attention from the 
machine learning community. Most research has focused on parallelizing existing 
algorithms to allow the application of large-scale computation to large-scale data 
analysis. The topic modeling community has been particularly interested in these 



approaches, producing parallel variational inference (Zhai et al. 2012) and parallel 



MCMC (Newman et al.. 2009, Smola and Narayanamurthy , 2010, Ahmed et al. 2012) 



An alternative approach to large-scale learning uses stochastic optimization. These 
techniques have a long history in the machine learning literature, going back to the 
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early days of neural networks (Rosenblatt, 1958, Widrow and Hoff, 1960). Interest in 



stochastic optimization algorithms for machine learning has blossomed again in recent 
years as the theoretical efficiency of stochastic gradient algorithms has become better 



understood (LeCun et al. 1998 Bottou and LeCun 2004 Bottou and Bousquet, 2011) 



As examples, these algorithms have been successfully adapted to fit support vector 



machines (Shalev-Shwartz et al. 2007) and perform dictionary learning (Mairal et al. 



2010). This work has shown both empirically and theoretically that using stochastic 



learning can dramatically accelerate learning speed in the presence of large amounts 
of data, without relying on the availability of vast computational resources. 

The work discussed above uses stochastic optimization in non-probabilistic contexts. 



Sato and Ishii (2000) developed a stochastic variant on the expectation-maximization 



(EM) algorithm of Dempster et al. (1977) that finds maximum-marginal- likelihood 
estimates of parameters to probabilistic models, integrating over a set of hidden 



variables. Neal and Hinton (1999) proposed a similar algorithm, though not couched 



in the theory of stochastic optimization. Finally, Sato (2001) developed an online 



variational Bayes algorithm using stochastic optimization, although it can only be 
applied to the fairly limited set of models that can be fit using the classic EM algorithm. 
In contrast to Sato's algorithm, stochastic variational inference generalizes to a much 
wider set of probabilistic models, in particular those that are amenable to closed-form 



coordinate ascent inference (Bishop et al., 2003, Xing et al., 2003, Beal, 2003) 



The organization of this paper. In Section [2j we review variational inference for 
graphical models and then derive stochastic variational inference. In Section [3| we 
review probabilistic topic models and Bayesian nonparametric models and then derive 
the stochastic variational inference algorithms in these settings. In Section |4| we study 
stochastic variational inference on several large text data sets. 



2. Stochastic Variational Inference 

We derive stochastic variational inference, a stochastic optimization algorithm for 
mean-field variational inference. Our algorithm approximates the posterior distribution 
of a probabilistic model with hidden variables, and can handle large (or even streaming) 
data sets of observations. 

We divide this section into four parts. 

1. We define the class of models to which our algorithm applies. We define local 
and global hidden variables, and requirements on the conditional distributions 
within the model. 
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Figure 1: A graphical model with observations x\._jf, local hidden variables zi-.n and 
global hidden variables (3. The distribution of each observation x n only 
depends on its corresponding local variable z n and the global variables (3. 
(Though not pictured, each hidden variable z n , observation x n , and global 
variable (3 may be a vector consisting of multiple random variables.) 



2. We review mean-field variational inference, an approximate inference strategy 
that seeks to find a tractable distribution over the hidden variables that is close 
to the posterior distribution. We derive the traditional variational inference 
algorithm in our setting, which is a coordinate ascent algorithm. 

3 . We review the natural gradient, and derive the natural gradient of the variational 
objective function. The natural gradient closely relates to the coordinate ascent 
variational inference algorithm. 

4. We review stochastic optimization, a technique that uses noisy estimates of a 
gradient to optimize an objective function, and apply it to variational inference. 
Specifically, we use stochastic optimization with noisy estimates of the natural 
gradient of the variational objective. These estimates arise from repeatedly 
subsampling the data set. We show how stochastic variational inference easily 
builds on traditional variational inference algorithms, but can handle much 
larger data sets. 



2.1 Models with local and global hidden variables 

We study models involving observations, global hidden variables, local hidden variables, 
and fixed parameters. The iV observations are x = Xi-n', the vector of global hidden 
variables is (3; the iV local hidden variables are z = Zi : jv, each of which is a collection 
of J variables z n = z n ^-j; the vector of fixed parameters is 

2. The fixed parameters a can contain parameters that partly govern any of the random variables 
(e.g., fixed parts of the conditional distribution of observations). However, to keep notation simple, 
we consider them only to govern the global hidden variables. 
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The distinction between local and global hidden variables is determined by the 
conditional dependencies that define the model. In particular, the nth observation x n 
and the nth local hidden variable z n are conditionally independent (given 0) of all 
other observations and local hidden variables x_ n and z_ n . (The notation x_ n refers 
to the set of variables except the nth.) That is, the joint distribution factorizes into a 
global term and a product of local terms, 

N 

p((3, z 1:N , x 1:N ) = p(p) Y[ p{z n , x n \(3). (1) 

71=1 

Figure [T] illustrates the graphical model. Our goal is to approximate the posterior 
distribution of the hidden variables given the observations, p(z, /3\x). 

This kind of model frequently arises in Bayesian statistics. In that setting, the global 
variables /3 are parameters endowed with a prior p(j3) and each local variable z n 
contains the hidden structure that governs the nth observation. For example, consider 
a Bayesian mixture of Gaussians. The global variables are the mixture proportions 
and the means and variances of the mixture components; the local variable z n is the 
hidden cluster label for the nth observation x n . 

We have described the independence assumptions of the hidden variables. We make fur- 
ther assumptions about the complete conditionals in the model. A complete conditional 
is the conditional distribution of a hidden variable given the other hidden variables and 
the observations. We assume that these distributions are in the exponential family, 

p(/3 \x,z) = h{p) exp{n g (x, z, a) T t(/3) - a g (n g (x, z, a))} (2) 

p(z nJ | x n , z n _j, f$) = h(z nJ ) exp{r} £jj (x n , z n _j, f3) T t(z n>j ) - a i:j (n ej (x n , z n _ h (3))}. 

(3) 

The scalar functions h(-) and a(-) are respectively the base measure and log-normalizer; 
the vector functions //(■) and t(-) are respectively the natural parameter and sufficient 
statistics^ These are conditional distributions, so the natural parameter is a function 
on the variables that are being conditioned on. (The subscripts on the natural 
parameter rj indicate complete conditionals for local or global variables.) Notice that 
for the local variables z n j, the complete conditional distribution is determined by 
the global variables (3 and the other local variables in the nth context, i.e., the nth 
data point x n and the local variables z n -j. This follows from the factorization in 
Equation [T] 

These assumptions on the complete conditionals imply a conjugacy relationship 
between the global variables and the local contexts (z n ,x n ); and this relationship 

3. We use overloaded notation for the functions h(-) and t(-) so that they depend on the names of 
their arguments; for example, h{z n f) can be thought of as a shorthand for the more formal (but 
more cluttered) notation h Zn Az n j). 
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implies a specific form of the complete conditional for /3. Specifically, the distribution 
of the local context given the global variables is in an exponential family, 



p(x n , z n \f3) = h(x n , z n ) exp{g((3) T t(x n , z n ) - a e (g((3))}. 



(4) 



Note that the global variables are passed through a function g((3) to form the natural 
parameter. The prior distribution p(/3) is also in an exponential family, 



p((3) = h(/3) exp{a T t(/3) - a g {a)}, 



(5) 



where the sufficient statistics are t(/3) = (g({3), — a^(g(/3))) and thus the hyperparameter 
a has two components a = (a^o^)- The first component is a vector of the same 
dimension as g(/3); the second is a scalar. 

Equations [4] and [5] imply that the complete conditional for the global variable in 
Equation [2] is in the same exponential family as the prior with parameter 



■q g {x,z,a) 



{atl + Y,n=l t ( Z n,Xn),Ol2 + N). (6) 

This form will be important when we derive stochastic variational inference in Sec- 



tion 



2.4[ For a general discussion of conjugacy and the exponential family, see|Bernardo 



and Smith (1994). 



This general family of distributions — those with local and global variables, and 
where the complete conditionals are in the exponential family — contains many useful 
statistical models from the machine learning and statistics literature. Examples include 



Bayesian mixture models (Attias, 2000), latent Dirichlet allocation (Blei et al. 2003), 



1960; Fox et al. 2011a), factorial models (Ghahramani and Jordan, 1997), hierarchical 



linear regression models (Gelman and Hill, 2007), hierarchical probit classification 



models (McCullagh and Nelder 



1989 



hidden Markov models (and many variants) ( |Rabiner 1989; Fine et al. 1998; Fox 



et al. , 2011b Paisley and Car in} 2009a), Kalman filters (and many variants) (Kalman 



Girolami and Rogers 2006 ) , probabilistic matrix 



factorization models ( 


Tipping 


and Bishop 


1999 


Collins et al. 


2002 


Wang, 


2006 


Salakhutdinov and MnihH2008; 


Paisley and Carin, 


2009b, 


Hoffman et al. 


2010b 


), and 



1995 Teh et al. 



certain Bayesian nonparametric mixture models (Antoniak 
2006a[ )f1 



1974, Escobar and West 



Analyzing data with one of these models amounts to computing the posterior distri- 
bution of the hidden variables given the observations: 



p(z,P\x) 



(7) 



4. We note that our assumptions can be relaxed to the case where the full conditional p((3\x, z) is 
not tractable, but each partial conditional p(0k\x, z, /3_k) associated with the global variable j3k 
is in a tractable exponential family. The topic models of the next section do not require this 
complexity, so we chose to keep the derivation a little simpler. 
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We then use this posterior to explore the hidden structure of our data or to make 
predictions about future data. Unfortunately, for many models (such as the examples 
listed above) we cannot directly compute the posterior because the denominator is an 
intractable integral. Thus we resort to approximate posterior inference, a problem that 
has been a focus of modern Bayesian statistics. We now turn to mean-field variational 
inference, which roots our strategy for scalable approximate inference. 



2.2 Mean- field variational inference 



Variational inference turns the inference problem into an optimization problem. We 
introduce a family of distributions over the hidden variables that is indexed by a 
set of free parameters, and then optimize those parameters to find the member of 
the family that is closest to the posterior of interest. (Closeness is measured with 
Kullback-Leibler divergence.) We use the resulting distribution, called the variational 
distribution, to approximate the posterior. 

In this section we review mean-field variational inference, which uses a variational 
family where each hidden variable is independent. We describe the variational objective 
function, discuss the mean-field variational family, and derive a coordinate ascent 
algorithm for fitting the variational parameters. This coordinate-ascent algorithm will 
be a stepping stone to stochastic variational inference. 



The evidence lower bound. Variational inference minimizes the Kullback-Leibler 
(KL) divergence from the variational distribution to the posterior distribution by 
maximizing the evidence lower bound (ELBO), a lower bound on the logarithm of 
the marginal probability of the observations \ogp(x). The ELBO is equal to the KL 
divergence up to an additive constant. 

We derive the ELBO by introducing a distribution over the hidden variables q(z, f3) 
and using Jensen's inequality^] 



logp(a;) 



log 
log 



p{x,z,/3) 



log E 



p(x, z, (3) 

>E q [\ogp(x,z,f3)}-E q [\ogq(z,f3)} 
= C{q). 



(8) 
(9) 

(10) 

(11) 
(12) 



5. Recall that Jensen's inequality implies that the logarithm of a function's expected value is greater 
than or equal to the expected value of the logarithm. 
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The ELBO contains two terms. The first term is the expected log joint, E, q [logp(x, z, /?)]. 
The second term is the entropy of the variational distribution, — E g [logg(z, Both 
these terms depend on q(z, /3), a distribution of the hidden variables. 

We restrict q(z, (3) to be in a family that is tractable, i.e., one for which the expectations 
in the ELBO can be efficiently computed, and we try to find the member of the family 
that maximizes the ELBO. Solving this maximization problem is equivalent to finding 



the member of the family that is closest in KL divergence to the posterior (Jordan 
et a!4|1999t |Wainwright and Jordan[ |2008fl , 



KL{q(z,f3)\\p(z,P\x)) = E q [logq{z,(3)]-E q \p(z,f3\x)} 

= E q [log q(z, (3)} - E q \p(x, z, (3)] + logp(x) 
= —C{q) + const. 

Note we absorbed logp(a;) into a constant because it does not depend on q. We use 
the optimized distribution q*(z, (3) as a proxy for the posterior. 

The mean-field variational family. The simplest variational family of distribu- 
tions is the mean-field family. In this family, each hidden variable is independent and 
governed by its own parameter, 

N J 

q(z, (3) = q(fi | A) J] H q(z nj | <j> n>j ). (13) 

n=l j=l 

The global parameters A govern the global variables; the local parameters 4> n govern 
the local variables in the nth context. The ELBO is a function of these parameters. 



Equation [13] gives the factorization of the variational family, but does not specify 
its form. We set q((3\\) and q{z n j\(f) n j) to be in the same exponential family as the 
complete conditional distributions p(/3\x,z) and p(z n> j\x n , z n -j, /3) (from Equations^ 
and [3]) and since we focus on conjugate exponential family models these conditionals 
are in the same family as the prior. The variational parameters A and n j are the 
natural parameters to those families, 

q{(3 I A) = h{(3) exp{A T t(/3) - a g (X)} (14) 

q(Zn,j I <frn,j) = K Z n,j) e*P{4>l,jt( z n,j) ~ a ej ((j) n j)}. (15) 

Again, we overload the base measures h(-) and sufficient statistics t(-). Assuming that 
these exponential families are the same as their corresponding conditionals means 



that t(-) and h(-) in Equation 14 are the same functions as t(-) and h(-) in Equation \l 



Likewise, £(•) and h(-) in Equation 15 are the same as in Equation[3j We will sometimes 
suppress the explicit dependence on <fi and A, substituting q(z n j) for q(z n j\<f) n j) and 
q{P) for q{p\\). 



10 



Stochastic Variational Inference 



The mean- field family has several computational advantages. At the outset, the 
entropy term decomposes, 

N J 

-Ejlogg(z,/3)] = -E A [logg(/3)]-^^E^.[logg(z nj )], (16) 

71=1 j=l 

where .[•] denotes an expectation with respect to q(z n> j \ (j) n ,j) an d Ea[-] denotes an 
expectation with respect to q(f3 | A). 

The gradient of the ELBO and coordinate ascent inference. We have defined 



the objective function in Equation 11 and the variational family in Equation 13 Our 



goal is to optimize the objective with respect to the variational parameters. 



In traditional mean-field variational inference, we optimize Equation 11 with coordi- 
nate ascent. We iteratively optimize each variational parameter, holding the other 
parameters fixed. With the assumptions that we have made about the model and 
variational distribution — that each conditional is in an exponential family and that 
the corresponding variational distribution is in the same exponential family — we can 
optimize each coordinate in closed form. 

We first derive the coordinate update for the parameter to the variational distribution 
of the global variables q((3 | A). As a function of A, we can rewrite the objective as 

£(A) = E q [\og p{f3\z,x)] -Ejlogg(/3)] + const. (17) 

The first two terms are expectations that involve 0; the third term is constant with 
respect to A. The constant absorbs quantities that relate to the other hidden variables. 
Those quantities do not depend on q(/3 | A) because all variables are independent in 
the mean-field family. 



Equation [17] reproduces the full ELBO in Equation 11 The second term of Equation 17 



is the entropy of the global variational distribution. The first term derives from the 
expected log joint likelihood, where we use the chain rule to separate terms that 
depend on the variable (5 from terms that do not, 

E 9 [logp(/3, z, x)] = E,[logp(z, x)] + E 5 [logp(/3 | z, x)}. (18) 

The constant absorbs E, q [logp(z, x)}, leaving the expected log conditional E g [logp(/3 | z, x) 



Finally, we substitute the form of q(/3 | A) in Equation 14 to obtain the final expression 
for the ELBO as a function of A, 

£(A) = E q [r] 9 (z, x, a)] T V A a(A) - A T V A a(A) + a(A) + const. (19) 

In the first term of this expression, we used the exponential family identity that the 
expectation of the sufficient statistics is the gradient of the log normalizer, E A [t(/3)] = 
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Vao(A). The constant has further absorbed the expected log normalizer of the 
conditional distribution —~E q [a(j]g(z,x,a))]. 



Equation 19 simplifies the ELBO as a function of the global variational parameter. 



To derive the coordinate ascent update, we take its gradient, 

V A £ = VX(A)(E^(aj, z, a)] - A). (20) 

We can zero this gradient by setting 

X = E4 % (x,z,a)]. (21) 

This sets the global variational parameter equal to the expected natural parameter 
of its complete conditional distribution. Implementing this update, holding all other 
variational parameters fixed, optimizes the ELBO over A. Notice that the mean-field 
assumption plays an important role. The update is the expected conditional parameter 
E, ( f > [i] g (x, z, a)], which is an expectation of a function of the other random variables 
and observations. Thanks to the mean-field assumption, this expectation is only a 
function of the local variational parameters and does not depend on Af 



The gradient for each local parameter <p n j is nearly identical to the global case, 

V 0nj £ = Vj Bji a/ J -(^ nj -)(E A ,^ Bi _ J .[^(xn, Zn-j, P)} - 4>n,j)- (22) 
It equals zero when 

4> ntj = Ex^^lrjeixn, z n _j, (3)]. (23) 

Mirroring the global update, this expectation does not depend on <p n j. However, while 
the global update depends on all the local variational parameters, this update only 
depends on the global parameters and the other parameters associated with the nth 
context. 



The updates in Equations 21 and 23 form the algorithm for coordinate ascent variational 
inference, iterating between updating each local parameter and the global parameters. 
The full algorithm is in Figure [2j It is guaranteed to find a stationary point of the 
ELBO. It is the "classical" variational inference algorithm, used in many settings. 

As an aside, these updates reveal a connection between mean-field variational inference 



and Gibbs sampling (Gelfand and Smith, 1990). In Gibbs sampling, we iteratively 



sample from each complete conditional. In variational inference, we take variational 
expectations of the natural parameters of the same distributions. 



6. Computing this expectation for specific models is easy for directed graphical models with tractable 
complete conditionals. The details use the relationship between the sufficient statistics and log- 
normalizers. In Section [3j we show that these updates are tractable for many topic models. 
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1: Initialize A^ randomly. 
2: repeat 

3: for each local variational parameter <p n> j do 

4: Update 4> nd , (fa'j = E A(t -i) [r]e,j{x n , z n -j, /?)]. 
5: end for 

6: Update the global variational parameters, A**) = E^t) \jf] g {zi.^ 1 x\;n)\- 
7: until the ELBO converges 



Figure 2: Coordinate ascent mean-field variational inference. 



The local steps (Steps 3 and 4 in Figure [2]) are trivially parallelizable. The data 
can be distributed across many machines and the local variational updates can be 
implemented in parallel. These results can then be aggregated in Step 6 to find the 
new global variational parameters. 

However, the local steps also reveal an inefficiency in the algorithm. The algorithm 
begins by initializing the global parameters A randomly — the initial value of A does 
not reflect any regularity in the data. But before completing even one iteration, the 
algorithm must analyze every data point using these initial (random) values. This 
is wasteful, especially if we expect that we can learn something about the global 
variational parameters from only a subset of the data. Further, if the data are "infinite", 
i.e., if they represent a data source where information arrives in a constant stream, 
then this algorithm can never complete even one iteration. 

We will solve this problem with stochastic optimization. With stochastic optimization 
we can handle massive and streaming data sets, continually improving our estimates 
of the global variational parameters as we analyze more observations. The efficiency 
of our stochastic optimization algorithm hinges on using the natural gradients of the 
variational objective, and on the relationship between the natural gradients and the 
coordinate updates that we derived here. We will next discuss natural gradients and 
their role in mean- field variational inference. 



2.3 The natural gradient of the ELBO 



The natural gradient of a function accounts for the geometry of its parameter space, 
using a Riemannian metric to adjust the direction of the traditional gradient. Amari 



(1998) discusses natural gradients for maximum-likelihood estimation, which give 
faster convergence than standard gradients. In this section we describe Riemannian 
metrics for probability distributions and the natural gradient of the ELBO. 
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Gradients and probability distributions. The classical gradient method for 
maximization tries to find a maximum of a function /(A) by taking a series of steps of 
size p in the direction of the gradient, 

A (t+i) = aW+pVa/(A w ). (24) 

The gradient points in the direction of steepest ascent. That is, the gradient Va/(A) 
points in the same direction as 

argmax/(A + dX) subject to \ \d\\\ 2 < e 2 (25) 



for sufficiently small e, assuming that / is differentiable at A. In words, Equation 25 
implies that if we could only move a tiny distance e away from A then we should 
move in the direction of the gradient. Initially this seems reasonable, but there is a 
complication. The gradient direction implicitly depends on the Euclidean distance 
metric associated with the space in which A lives. There is no reason to think that 
the Euclidean metric captures an inherently meaningful notion of distance between 
settings of A; for example, it is sensitive to rescalings of the elements of A, a sensitivity 
that leads to moves in suboptimal directions. 

The problem with Euclidean distance is especially clear in our setting, where we are 
trying to optimize the ELBO with respect to the parameter of a probability distribution. 
For the moment, we focus on the variational distribution of the global variable q(f3\X). 
When optimizing over a probability distribution, the Euclidean distance between two 
parameter vectors A and A' is often not the best way of measuring the dissimilarity of 
the distributions q(/3 | A) and q(/3 | A'). For example, consider q((3) to be a univariate 
normal and A to be the mean \i and scale a of that distribution. The distributions 
A/"(0, 10000) and Af(10, 10000) are almost indistinguishable, and the Euclidean distance 
between their parameter vectors is 10. By contrast, the distributions J\f(Q, 0.01) and 
jV(0.1, 0.01) barely overlap, but this is not reflected in the Euclidean distance between 
their parameter vectors, which is only 0.1. 

Natural gradients and probability distributions. A natural measure of dissim- 
ilarity between probability distributions is the symmetrized KL divergence 



D%£(\,\')=E X log^ +E A , log^i . (26) 



«(0|A') 



«(/8[ A) 



Symmetrized KL depends on the distributions themselves, rather than on how they 
are parameterized; it is invariant to parameter transformations. 

With distances defined using symmetrized KL, we find the direction of steepest ascent 
in the same way as for gradient methods, 

argmax/(A + dX) subject to _D^™(A, A + dX) < e. (27) 

d\ 
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As e — > 0, the solution to this problem points in the same direction as the natural 
gradient. While the Euclidean gradient points in the direction of steepest ascent in 
Euclidean space, the natural gradient points in the direction of steepest ascent in the 
Riemannian space, i.e., the space where local distance is defined by KL divergence 
rather than the L 2 norm. 



We manage the more complicated constraint in Equation 27 with a Riemannian 
metric G(X). This metric defines linear transformations of A under which the squared 
Euclidean distance between A and a nearby vector A + dX is the KL between q(/3\X) 
and q((3\X + dX), 

dX T G(X)dX =D%™{X,X + dX), (28) 

and note that the transformation can be a function of A. For a distance measure (such 
as symmetrized KL) and its corresponding metric G(X), we find the natural gradient 



by premultiplying the gradient by G X (A) (Amari, 1998), 



V A /(A)^G(A)-V A /(A). 



(29) 



When the distance is the symmetrized KL divergence of Equation 26 , the Riemannian 



metric is the Fisher information matrix G (Amari, 1982, Kullback and Leibler l 1951), 



G(X) = IE a [(V A log g09 | A))(V A logg(/3 | A)) T ] 



(30) 



We can show that Equation 30 satisfies Equation l28] by approximating logg(/3 | A + <iA) 



using a first-order Taylor approximation of log q about A and plugging the result into 



Equation 26 When q(/3\ X) is in the exponential family (Equation 14) the metric is 



the second derivative of the log normalizer, 



G(X) = E A [(V A logp(/3 | A))(V A logp(/3 | A)) T ] 

= e a [(*(/?) - E X [t(p)])W) - Mm]V] 

= Vja(A). 



(31) 



This is due to the exponential family identity that the Hessian of the log normalizer 
function a with respect to the natural parameter A is the covariance matrix of the 
sufficient statistic vector t{0). 



Natural gradients and mean field variational inference. We now return to 
variational inference, and compute the natural gradient of the ELBO with respect 
to the variational parameters. Researchers have used the natural gradient in varia- 



tional inference for nonlinear state space models (Honkela et al. , 2008) and Bayesian 
mixtures flSato[ |2001[ )f| 



7. Our work here — using the natural gradient in a stochastic optimization algorithm — is closest to 



that of Sato (2001 (, though we develop the algorithm via a different path and that paper does 



not address models for which the joint conditional p(z n \/3, x n ) is not tractable. 
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Consider the global variational parameter A. The gradient of the ELBO with respect 



to A is in Equation 20 Since A is a natural parameter to an exponential family 



distribution, the Fisher metric defined by q(/3) is Vfa(A). Note that the Fisher metric 



is the first term in Equation 20 We premultiply the gradient by the inverse Fisher 
information to find the natural gradient. This reveals that the natural gradient has 
the following simple form, 

Va£ = E rt> [r] g (x, z, a)] — A. (32) 
An analogous computation goes through for the local variational parameters, 

V«7i n , 3 £ = Ex,4> n ,-j[Vi{ x n, Zn-j, P)] - <Pnj- (33) 

The natural gradients are closely related to the coordinate ascent updates. Consider 
a full set of variational parameters A and (p. We can compute the natural gradient 



by computing the coordinate updates in parallel, i.e., the first terms in Equation 32 



and Equation 33, and subtracting the current setting of the parameters. The classical 
coordinate ascent algorithm of Figure [2] was not derived as a natural gradient algorithm, 
but it can be interpreted as a projected natural gradient algorithm; updating any 
single parameter A or <fi n by taking a natural gradient step of length 1 is equivalent 



to performing the update in Equation 21 or Equation 23 



We have motivated natural gradients by mathematical reasoning around the geometry 
of the parameter space. Even more important in this setting, however, is that natural 
gradients are easier to compute than classical gradients, and this opens the door 
to efficient gradient-based algorithms for variational inference. They are easier to 
compute because premultiplying by the Fisher information matrix — which we must 
do to compute the classical gradient but which is not needed to compute the natural 
gradient — is prohibitively expensive for variational parameters with many components. 
As we will see in the next section, being able to efficiently compute gradients lets us 
develop scalable variational inference algorithms. 



2.4 Stochastic variational inference 

We have reviewed mean-field variational inference in models with exponential family 
conditionals and showed that the natural gradient of the variational objective function 
is easy to compute. We now discuss stochastic optimization, optimization with repeated 
noisy estimates of the gradient, which is the basis for scalable variational inference. 

Stochastic optimization. Stochastic optimization algorithms follow noisy esti- 
mates of the gradient with a decreasing step size. Noisy estimates of a gradient are 
often cheaper to compute than the true gradient, and following such estimates can 
allow algorithms to escape shallow local optima of complex objective functions. In 
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statistical estimation problems, including variational inference of the global parameters, 
the gradient can be written as a sum of terms for each data point and we can compute 
a fast noisy approximation by subsampling the data. With certain conditions on the 



step-size schedule, these algorithms provably converge to an optimum (Spall, 2003 



Robbins and Monro 


Bottou 


(2003 


) gives 



1951 


)• 


Spall 


(2003 



Spall (2003) gives an overview of stochastic optimization; 



Consider an objective function /(A) and a random function B(X) that has expectation 
equal to the gradient E, q [B(X)] = Va/(A). The stochastic gradient algorithm, which is 
a type of stochastic optimization, optimizes /(A) by iteratively following realizations 
of B(X). At iteration t, the update for A is 



(34) 



where bt is an independent draw from the noisy gradient B. If the sequence of step 
sizes p t satisfies 



OG 



(35) 



then A* will converge to the optimal A* (if / is convex) or a local optimum of / (if 
not convex)]^ The same results apply if we premultiply the noisy gradients bt by 
a sequence of positive-definite matrices G^ 1 (whose eigenvalues are bounded). The 
resulting algorithm is 

A* = A* -1 + p t Gj 1 b t {\ t ~ 1 ). (36) 

As our notation suggests, we will use the Fisher metric for Gt, replacing stochastic 
Euclidean gradients with stochastic natural gradients. 



Stochastic variational inference. The coordinate ascent algorithm in Figure [2] is 
inefficient for large data sets because we must optimize the local variational parameters 
for each data point before re-estimating the global variational parameters. Stochastic 
variational inference uses stochastic optimization to fit the global variational parame- 
ters. We repeatedly subsample the data to form noisy estimates of the natural gradient 
of the ELBO, and we follow these estimates with a decreasing step-size. 

The resulting algorithm is in Figure |3j At each iteration we have a current setting of 
the global variational parameters. We repeat the following steps: 



1. Sample a data point from the set; compute its local variational parameters. 



To find a local optimum, / must be three-times differentiable and meet a few other technical 
requirements (Bottou 2003). The variational objective satisfies these criteria. 
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2. Form intermediate global variational parameters, as though we were running 
classical coordinate ascent and the sampled data point were repeated N times 
to form the collection. 

3. Update the global variational parameters to be a weighted average of the 
intermediate parameters and their current setting. 



We now show that this algorithm is stochastic natural gradient ascent of the global 
variational parameters. 

Our goal is to find a setting of the global variational parameters A that maximizes the 
ELBO. We define C(X) to be the ELBO when A is held fixed and the local variational 
parameters are set to their optimal value, 

£(A) = max£(A,0). (37) 

Holding the global parameters A fixed, let 0(A) be all the optimal local parameters. 

We can compute the (natural) gradient of £(A) by first finding the corresponding 
optimal local parameters 0(A) and then computing the (natural) gradient of £(A, 0(A)), 
holding 0(A) fixed. The reason is that the gradient of £(A) with respect to A is the 
same as the gradient of the two-parameter ELBO C(X, 0(A)), 

V A £(A) = V A £(A, 0(A)) + (V a 0(A)) t V^£(A, 0(A)) (38) 
= V A £(A,0(A)), (39) 

where Va0(A) is the Jacobian of 0(A) and we use the fact that the gradient of C(X, 0) 
with respect to is zero at 0(A). 

Stochastic variational inference optimizes the maximized ELBO £(A) by subsampling 
the data to form noisy estimates of the natural gradient. First, we decompose £(A) 
into a global term and a sum of local terms, 

N 

£(A)=Ejlogp(/3)]-Ejlog g (/3)]+Vmax(Ejlogp(z n ,xJ/3)]-E ? [logg(z n )]). (40) 

— : <?-n 
n=l 



Now consider a variable that chooses an index of the data uniformly at random, 
/ ~ Unif(l, . . . , N). Define £/(A) to be the following random function of the variational 
parameters, 

£/(A) = E g [log p(/3)] -Ejlogg(/3)] + iVmax (E g [logp(2j, X/ | (3) - E 9 [logg(z/)]) . (41) 



The expectation of Cj is equal to the objective in Equation [40} Therefore, the natural 
gradient of £j with respect to each global variational parameter A is a noisy but 
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unbiased estimate of the natural gradient of the variational objective. This process — 
sampling a data point and then computing the natural gradient of Cj — will provide 
cheaply computed noisy gradients for stochastic optimization. 

We now compute the noisy gradient. Suppose we have sampled the ith data point. 



Notice that Equation [41] is equivalent to the full objective of Equation [40] where the 
ith data point is observed 7V times. Thus the natural gradient of Equation [41] — which 
is a noisy natural gradient of the ELBO — can be found using Equation [32] 

VX; = 1^(4^^, a)] -A, (42) 

where {x\ , z± } are a data set formed by N replicates of observation x n and hidden 
variables z n . 

We compute this expression in more detail. Recall the complete conditional rj g (x, z, a) 
from Equation [6] From this equation, we can compute the conditional natural 
parameter for the global parameter given N replicates of x n , 

rj g {z\ N \xf\ a) = a + N(t(z n , x n ), 1). (43) 



Substituting this into the natural gradient of the ELBO from Equation [32] gives the 
noisy natural gradient, 

V A A = a + jm MX) [(t(zi, Xi ), 1)] - A, (44) 

where (f>i(X) = argmax^,. £j. While the full natural gradient would use the local 
variational parameters for the whole data set, the noisy natural gradient only considers 
the local parameters for one randomly sampled data point. These noisy gradients are 
cheaper to compute. 

Finally, we use these natural gradients in a Robbins-Monro algorithm to optimize the 
ELBO. At each iteration we update the global variational parameter with a noisy 
gradient. The step-size at iteration t is p t . The update is 

A W = A^ 1 ) + p 4 (E, 7(A(t _ 1)) [r /g (4 7V ),xf ) ,«)]-A^ 1 )) (45) 
= (1 -fH)\W +p t E MXit - 1) [ Vg (zf\x ( i N >,«)]. (46) 

This is a weighted average of the previous estimate of A and the estimate of A that we 
would obtain if the sampled data point was replicated N times. 

Figure [3] presents the full algorithm. At each iteration, the algorithm has an estimate 
of the global variational parameter A 1 -' -1 ). It samples a single data point from the 
data and computes the "single data-point" optimal global variational parameter, i.e., 
the next value of A if the data set contained N replicates of the sampled point. We 
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1: Initialize A^ randomly. 

2: Set the step-size schedule p t appropriately. 

3: repeat 

4: Sample a data point x t uniformly from the data set. 
5: Compute its local variational parameter, 

( j> = E x{t - 1) [ % (x i t N \z^)]. 

6: Compute intermediate global parameters as though x t is replicated N times, 

\ = E,[ % (z { t N \x ( t N) )]. 

7: Update the current estimate of the global variational parameters, 

AW = (l- Pt )A (t - 1) +p t A. 

8: until forever 

Figure 3: Stochastic variational inference. 
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emphasize that this is much cheaper than iteratively optimizing the local variational 
parameters for each data point (as in the algorithm of Figure [2]). 



Finally, it sets the new estimate of the global variational parameter to be a weighted 
average of the previous estimate and the single-data-point optimal. We set the step-size 
at iteration t as follows, 

Pt = (t + r)-\ (47) 



This satisfies the conditions in Equation 35 The forgetting rate k e (0.5, 1] controls 



how quickly old information is forgotten; the delay r > downweights early iterations. 
In Section [4] we fix the delay to be one and explore a variety of forgetting rates. Note 
that this is just one way to parameterize the learning rate. As long as the step size 



conditions in Equation [35] are satisfied, this iterative algorithm converges to a local 
optimum of the ELBO. 



2.5 Extensions 

We now describe two extensions of the basic stochastic inference algorithm in Figure |3] 
the use of multiple samples ("minibatches") to improve the algorithm's stability, and 
methods for hyperparameter estimation. 



Minibatches. So far, we have considered stochastic variational inference algorithms 
where only one observation x n is analyzed at a time. Many stochastic optimization 



algorithms benefit from "minibatches," i.e., several examples at a time (Bottou and 



Bousquet 


•21 ION; 


Liang et al. 


2009 


Mairal et al. 


2010 



inference, we can sample a set of S examples at each iteration x tj x-s (with or without 
replacement), compute the local variational parameters S (A^ -1 ^) for each data point, 
compute the intermediate global parameters X s for each data point Xt yS , and finally 
average the A s variables in the update 



A (t) 



Pt 



t-i 



Pi 
s 



(48) 



The stochastic natural gradients associated with each point x s have expected value 
equal to the gradient. Therefore, the average of these stochastic natural gradients has 
the same expectation and the algorithm remains valid. 

There are two reasons to use minibatches. The first reason is to amortize any 
computational expenses associated with updating the global parameters across more 
data points; for example, if the expected sufficient statistics of /3 are expensive to 
compute, using minibatches allows us to incur that expense less frequently. The 
second reason is that it may help the algorithm to find better local optima. Although 
stochastic variational inference is guaranteed to converge to a local optimum, it may 
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be that taking large steps on the basis of very few data points leads the algorithm to 
a poor local optimum. As we will see in Section |4| using more of the data per update 
can help the algorithm avoid such pathological local optima. 

Empirical Bayes estimation of hyperparameters. In some cases one may want 
to both estimate the posterior of the hidden random variables (5 and z and obtain a 
point estimate of the values of the hyperparameters a. One approach to fitting a is 



maximum marginal likelihood, also known as empirical Bayes (Maritz and Lwin 1989). 
Ideally, the hyperparameters a would be set to maximize the marginal likelihood 
of the data p(x \ a). Since we cannot compute p(x \ a) exactly, an alternative is to 
instead maximize the lower bound C over a. In the non-stochastic setting, a can be 
optimized by interleaving the coordinate ascent updates in Figure [2] with an update 
for a that increases the ELBO with the variational parameters A and 0i : tv held fixed. 
This is called "variational expectation-maximization." 

In the stochastic setting, we update a simultaneously with A. We can take a step in 
the direction of the gradient of the replicated ELBO Ct with respect to a, scaled by 
the step-size p t , 



+ p 4 V a £ t (A ( ^ 1) ,0,«). 



Here A*-* ^ are the global parameters from the previous iteration and <p are the 
optimized local parameters for the currently sampled data point. 



3. Stochastic Variational Inference in Topic Models 



We derived stochastic variational inference, a scalable inference algorithm that can be 
applied to a large class of hierarchical Bayesian models. In this section we show how 
to use the general algorithm of Section [2] to derive stochastic variational inference for 
two probabilistic topic models: latent Dirichlet allocation (LDA) ( |Blei et ah 2003[ ) 
and its Bayesian nonparametric counterpart, the hierarchical Dirichlet process (HDP) 



topic model (Teh et al. 2006a). 



Topic models are probabilistic models of document collections that use latent variables 
to encode recurring patterns of word use (Blei, 2012). Topic modeling algorithms 



are inference algorithms; they uncover a set of patterns that pervade a collection 
and represent each document according to how it exhibits them. (These patterns 
tend to be thematically coherent, which is why the models are called "topic models.") 
Topic models are used for both descriptive tasks, such as to build thematic navigators 
of large collections of documents, and for predictive tasks, such as to aid document 
classification. Topic models have been extended and applied in many domains. 

Topic models assume that the words of each document arise from a mixture of multi- 
nomials. Across a collection, the documents share the same mixture components 
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(called topics). Each document, however, is associated with its own mixture pro- 
portions (called topic proportions). In this way, topic models represent documents 
heterogenously — the documents share the same set of topics, but each exhibits them to 
a different degree. For example, a document about sports and health will be associated 
with the sports and health topics; a document about sports and business will be 
associated with the sports and business topics. They both share the sports topic, but 
each combines sports with a different topic. More generally, this kind of model is 
known as a mixed-membership model. 

The central computational problem in topic modeling is posterior inference: Given 
a collection of documents, what are the topics that it exhibits and how does each 
document exhibit them? Most posterior inference algorithms for topic models are 
based on Markov chain Monte Carlo (MCMC) sampling or variational inference. In 
practical applications of topic models, scale is important. Topic models promise 
an unsupervised approach to organizing large collections of text (and, with simple 
adaptations, images, sound, and other data). Thus they are a good testbed for 
stochastic variational inference. 

This section illustrates how to use the results from Section [2] to develop algorithms 
for specific models. We will derive the algorithms in several steps: (1) we specify the 
model assumptions; (2) we derive the complete conditional distributions of the latent 
variables; (3) we form the mean-field variational family; (4) we derive the corresponding 
stochastic inference algorithm. In Section [4| we will report our empirical study of 
stochastic variational inference with these models. 



3.1 Notation 



We follow the notation of Blei et al. (2003) 



Observations are words, organized into documents. The nth word in the dth 
document is Wd,n- Each word is an element in a fixed vocabulary of V terms. 

A topic Pk is a distribution over the vocabulary. Each topic is a point on the 
V — 1 simplex, a positive vector of length V that sums to one. In LDA there 
are K topics; in the HDP topic model there are an infinite number of topics. 

Each document in the collection is associated with a vector of topic proportions 
6d, which is a distribution over topics. In LDA is a point on the K—l simplex. 
In the HDP topic model, 9d is a point on the infinite simplex. (We give details 



about this below in Section 3.3 



Each word in each document is assumed to have been drawn from a topic. The 
topic assignment Zd n indexes the topic from which wj n is drawn. 
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The only observed variables are the words of the documents, 
proportions, and topic assignments are latent variables. 



The topics, topic 



3.2 Latent Dirichlet allocation 

LDA is the simplest topic model. It assumes that each document exhibits K topics 
with different proportions. The generative process is 

1. Draw topics /3& ~ Diry (77) for k £ {1, . . . , K}. 

2. For each document d 6 {1, . . . , D}: 

(a) Draw topic proportions 9 ~ Dir^a). 

(b) For each word w £ {1, . . . , iV}: 

i. Draw topic assignment Zd, n ~ Mult (6^). 

ii. Draw word Wd, n ~ Mult(/3 2dn ). 



Figure [4] illustrates LDA as a graphical modelj^] 

In LDA, each document exhibits the same shared topics but with different proportions. 
We have assumed exchangeable Dirichlet priors, which take a scalar parameter. For 
example, the prior on topics is Duy (77). It produces V- vectors on the V— 1 simplex from 
a Dirichlet distribution with all parameters set to 77. The prior on topic proportions 
is analagous, but produces if-vectors on the K — 1 simplex. (We note that Blei 



et al. (2003) and Wallach et al. (2009) found improved empirical performance with 



non-exchangeable Dirichlet priors.) 

LDA models an observed collection of documents w = Wi-_d, where each Wd is a 
collection of words Wd,i-.N- Analyzing the documents amounts to posterior inference of 
p((3, 0,z\w). Conditioned on the documents, the posterior distribution captures the 
topics that describe them = fii-.K, how each document exhibits those topics 6 = 6i-.d, 
and which topics each word was assigned to z = Zi : d,i-.n- We can use the posterior to 
explore large collections of documents. 



The posterior is intractable to compute (Blei et al. , 2003). Approximating the posterior 
in LDA is a central computational problem for topic modeling. Researchers have 



developed many methods, including Markov chain Monte Carlo methods (Griffiths and 



Steyvers, 2004), expectation propagation (Minka and Lafferty, 2002), and variational 
inference (Blei et al. 2003, Teh et al. , 2006b, Asuncion et al. 2009). Here we use 



9. Each document d may not have the same number of words N. However, to keep the notation 
simple, we suppress TV's dependence on the document index d. 
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the results of Section [2] to develop stochastic inference for LDA. This scales the 



10 



original variational algorithm for LDA in Blei et al. (2003) to massive collections of 
documents, 



Indicator vectors and Dirichlet distributions. Before deriving the algorithm, 
we discuss two mathematical details. 

First, we represent categorical variables like the topic assignments z^ n an d observed 
words Wd, n with indicator vectors. An indicator vector is a binary vector with a 
single one. For example, the topic assignment z^ n can take on one of K values (one 
for each topic). Thus, it is represented as a K- vector with a one in the component 
corresponding to the value of the variable: if z k dn = 1 then the nth word is assigned to 
the kth topic. 

Second, we review the Dirichlet distribution. As we described above, a i^-dimensional 
Dirichlet is a distribution on the K — 1 simplex, i.e., positive vectors over K elements 
that sum to one. It is parameterized by a positive i^-vector 7, 



rr=i r( 7 i 



i=l 



where T(-) is the Gamma function, which is a real- valued generalization of the factorial 
function. The expection of the Dirichlet is its normalized parameter, 

E[0* 1 7] = =Jr— • (50) 

The expectation of its log uses which is the first derivative of the log Gamma 
function, 

E[log^| 7 ] = *( 7 fc) - * (J2li%) ■ (51) 

This can be derived by putting the Dirichlet in its exponential family form, noticing 
that log#fc are the sufficient statistics, and taking the derivative of the log-normalizer 
to find their expectation. 

Complete conditionals and variational distributions. We specify the global 
and local variables of LDA to place it in the stochastic variational inference setting of 
Section [2] In topic modeling, the local context is a document. The local observations 
are its observed words w^i-.n- The local hidden variables are the topic proportions 9d 
and the topic assignments z^i-n- The global hidden variables are the topics f3\-K- 



10. The algorithm we present was originally developed in Hoffman et al. ( 2010a[ |, which is a special 



case of the stochastic variational inference algorithm we developed in Section |2] 
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Recall from Section [2] that the complete conditional is the conditional distribution of a 
variable given all of the other variables, hidden and observed. In mean-field variational 
inference, the variational distributions of each variable are in the same family as the 
complete conditional. 

We begin with the topic assignment z^ n - The complete conditional of the topic 
assignment is a multinomial, 

p{zd,i = k | 9 d ,/3 1:K ,w d)i ) oc exp{log6> d)A; + log/3 k , Wd J. (52) 

Thus its variational distribution is a multinomial q(zd,i) = Mult(0rf j), where the 
variational parameter <pd,i is a distribution over the K topics. Per the mean-field 
assumptions, each observed word is endowed with a different variational distribution for 
its topic assignment. These variational distributions can capture that some observed 
words are likely to be from from one topic; others are likely to be from another. 

The complete conditional of the topic proportions is a posterior Dirichlet, 

P(0d | /3l:K, Zd) = Dh (a + J2n=l Z d,n) • ( 53 ) 

Since z^ n is an indicator vector, the parameter to this Dirichlet is the usual posterior 
parameter — for each topic k, its posterior Dirichlet parameter is the sum of the 
hyperparameter a and the number of words assigned to topic k. Note that while a 
is a scalar parameter to an exchangeable Dirichlet, the posterior Dirichlet is a full 
distribution with K different parameter values. 

With this conditional, the variational distribution of the topic proportions is also 
Dirichlet q{9d) = Dir(7d), where 7^ is a .fT-vector Dirichlet parameter. As for the topic 
assignments, we emphasize that there is a different variational Dirichlet parameter 
for each document. These variational distributions capture that different documents 
reflect the topics with different proportions. 

Notice that these are local hidden variables. The complete conditionals only depended 
on other variables in the local context (i.e., the document) and the global variables. 

Finally, the complete conditional for the topic (3 k is also a posterior Dirichlet, 

p(P k \Z,W) = Dh [r] + J2d=l En=l Z d,n W d,n) ■ (54) 

For a particular term and topic k, its posterior Dirichlet parameter is the sum of the 
hyperparameter rj and the number of times that the term was assigned to topic k. 
This is a global variable — its complete conditional depends on the words and topic 
assignments of the entire collection. 

The variational distribution for each topic is a V-dimensional Dirichlet, 

q(/3 k ) = Dir(A fc ). (55) 
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As we will see in the next section, the traditional variational inference algorithm for 
LDA is inefficient with large collections of documents. The root of this inefficiency is 



the update for the topic parameter A&, which (from Equation 54) requires summing 
over variational parameters for every word in the collection. 

Batch variational inference. With the complete conditionals in hand, we now 
derive the coordinate ascent variational inference algorithm, i.e., batch inference. We 
form each coordinate update by taking the expectation of the natural parameter of the 
complete conditional. This is the stepping stone to stochastic variational inference. 

The variational parameters are the global per-topic Dirichlets Ai : j<r, local per-document 
Dirichlets Jud: and local per-word multinomials 4>i-.di:N- Coordinate ascent variational 



inference iterates between updating all of the local variational parameters (Equation 23 ) 



and updating the global variational parameters (Equation 21). 



We update each document's local variational in a local coordinate ascent routine, 
iterating between updating each word's topic assignment and the per-document topic 
proportions, 

d , n ex exp{tf( Td ) + tt(A. lttd J forne{l,...,iV} (56) 

id = « + e!=iV ( 57 ) 

These updates derive from taking the expectations of the natural parameters of the 



complete conditionals in Equation 52 and Equation 53 (We then map back to the 



usual parameterizations of the multinomial and Dirichlet.) For the update on the 



topic assignment, we have used the Dirichlet expectations in Equation 51 For the 



update on the topic proportions, we have used that the expectation of an indicator is 
its probability, E q [zJJ = (p k dn . 

After finding variational parameters for each document, we update the variational 
Dirichlet for each topic, 

Afc = V + ES=l En=l <fi,n w d,n- (58) 

This update depends on the variational parameters for every document. 

Batch inference is inefficient for large collections of documents. Before updating the 
topics Xi-K, we compute the local variational parameters for every document. This is 
particularly wasteful in the beginning of the algorithm when, before completing the 
first iteration, we must analyze every document with randomly initialized topics. 

Stochastic variational inference Stochastic variational inference provides a scal- 
able method for approximate posterior inference in LDA. The global variational param- 
eters are the topic Dirichlets A^; the local variational parameters are the per-document 
topic proportion Dirichlets 7^ and the per-word topic assignment multinomials 
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1: Initialize A^ randomly. 

2: Set the step-size schedule pt appropriately. 

3: repeat 

4: Sample a document Wd uniformly from the data set. 

5: Initialize 7^ = 1, for k G {1, . . . , K}. 

6: repeat 

7: For n £ {1, ... , N} set 

<^ n oc exp {E[log 9 dtk ] + E[log } , fc G {1, . . . , if }. 

8: Set J d = OC + J2n <^,n- 

9: until local parameters <pd,n and 7^ converge. 
10: For k G {1, . . . , if} set intermediate topics 

N 

X k = T] + D 4>\ n W d ,n- 
n=l 

11: Set AW = (1 -p t )A(*- 1 ) + p t A. 
12: until forever 



Figure 5: Stochastic variational inference for LDA. The relevant expectations for each 
update are found in Figure |4} 
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We follow the general algorithm of Figure [3J Let A^ be the topics at iteration t. 
At each iteration we sample a document d from the collection. In the local phase, 
we compute optimal variational parameters by iterating between updating the per- 



MA:N 



document topic Dirichlet 7^ (Equation 57) and per- word topic assignment multinomial 



(Equation 56). This is the same subroutine as in batch inference, though here 



we only analyze one subsampled document. 

In the global phase we use these fitted local variational parameters to form intermediate 
topics, 

h = r] + Dj:Z =1 <t> k d>n w d>n . (59) 



This comes from Equation 58 for a corpus containing D replicates of document d. 
We then set the topics at the next iteration to be a weighted combination of the 
intermediate topics and current topics, 



A 



(t+i) 



[1 - Pt )\W + Pt \ k 



(60) 



The algorithm for stochastic variational inference for LDA is in Figure 5JF" 



Stochastic inference versus batch inference for LDA. Figure [6] illustrates 
the performance of 100-topic LDA on three large collections — Nature contains 350K 
documents, New York Times contains 1.8M documents, and Wikipedia contains 3.8M 
documents. (Section [4] describes the complete study, including the details of the 
performance measure and corpora.) We compare stochastic inference for LDA on the 
full collection to batch inference on a 100K document subset, for which batch inference 
can run in reasonable speed. We see that stochastic variational inference converges 
faster and to a better model. 



3.3 Bayesian nonparametric topic models with the HDP 

Stochastic inference for LDA lets us analyze large collections of documents. One 
limitation of LDA, however, is that the number of topics is fixed in advance. Typically, 



researchers find the "best" number of topics with cross-validation (Blei et al. 2003). 
However, for very large data (or streaming data) this approach is not practical. We 
address this issue with a Bayesian nonparametric topic model. 



11. This algorithm, as well as the algorithm for the HDP, specifies that we initialize the topics 
randomly. There are many ways to initialize the topics. We use a Gamma distribution, 

A fe ,„ - Gamma(l, 1) * D * 100/ (KV) + r). 

This relates to — but is not exactly the same as — having a corpus of size D with 100 words per 
document, and allocating those words randomly to different topics. 
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Figure 6: The per-word predictive log likelihood for a 100-topic LDA model on three 
large corpora. Stochastic variational inference on the full data converges 
faster and to a better place than batch variational inference on a reasonably 
sized subset. Section [4] gives the details of our empirical study. 



We derive stochastic variational inference for the Bayesian nonparametric variant of 
LDA, the hierarchical Dirichlet process (HDP) topic model. Like LDA, the HDP topic 
model is a mixed-membership model of text collections. However, the HDP assumes an 
"infinite" number of topics. Given a collection of documents, the posterior distribution 
of the hidden structure determines how many topics are needed to describe them. 
Further, the HDP is flexible in that it allows future data to exhibit new and previously 
unseen topics. 

More broadly, stochastic variational inference for the HDP topic model demonstrates 
the possibilities of stochastic inference in the context of Bayesian nonparametric 
statistics. Bayesian nonparametrics gives us a collection of flexible models — mixture 
models, mixed-membership models, factor models, and models with more complex 
structure — which grow and expand with data. Flexible and expanding models are 
particularly apt for analyzing large data sets, where searching for a specific latent 
structure (such as a number of topics or a tree structure of components) is prohibitive. 

This section is organized as follows. We first give some background on the Dirichlet 
process and its definition via Sethuraman's stick breaking construction, which is a 
distribution on the infinite simplex. We then show how to use this construction to form 
the HDP topic model and how to use stochastic variational inference to approximate 
the posterior p^] 



12. This algorithm first appeared in Wang et al. (2011 1. Here we place it in the more general context 



of Section [2] and relate it to stochastic inference for LDA 
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The stick-breaking construction of the Dirichlet process. Bayesian nonpara- 
metric (BNP) methods use distributions of distributions, placing flexible priors on the 
shape of the data-generating density function. BNP models draw a distribution from 
that prior and then independently draw data from that random distribution. Data 
analysis proceeds by evaluating the posterior distribution of the (random) distribution 
from which the data were drawn. Because of the flexible prior, that posterior can 
potentially have mass on a wide variety of distribution shapes. For a reviews of BNP 



and Blei (2011). 



methods, see the edited volume of Hjort et al. (2010) and the tutorial of Gershman 



The most common BNP prior is the Dirichlet process (DP). The Dirichlet process 
is parameterized by a base distribution Go and a scaling factor a. These are used 
to form a distribution of discrete distributions, i.e., those that place their mass on a 
countably infinite set of atoms. The locations of the atoms are independently drawn 
from the base distribution Go (which need not be discrete) and the closeness of the 
probabilities to Go is determined by the scaling factor a. When a is small, more mass 
is placed on fewer atoms, and the draw will likely look very different from Go; when 
a is large, the mass is spread around many atoms, and the draw will more closely 
resemble the base distribution. 

There are several representations of the Dirichlet process. For example, it is a 



normalized gamma process (Ferguson, 1973), and its marginalization gives the Chinese 



restaurant process (Pitman, 2002). We will focus on its definition via Sethuraman's 
stick breaking construction (Sethuraman, 1994). The stick- breaking construction 



explicitly defines the distribution of the probabilities that make up a random discrete 
distribution. It is the gateway to variational inference in Bayesian nonparametric 



models (Blei and Jordan, 2005) 



Let G ~ DP (a, Go) be drawn from a Dirichlet process prior. It is a discrete distribution 
with mass on an infinite set of atoms. Let (3k be the atoms in this distribution and 
be their corresponding probabilities. We can write G as 



G = ^<r h 8 h . 



(61) 



fc=i 



The atoms are drawn independently from Go- The stick- breaking construction specifies 
the distribution of their probabilities. 

The stick-breaking construction uses an infinite collection of beta-distributed random 
variables. Recall that the beta is a distribution on (0, 1) and define the following 
collection, 

u< ~ Beta(l, a) i € {1, 2, 3, . . .}. (62) 

These variables combine to form a point on the infinite simplex. Imagine a stick of 
unit length. Break off the proportion of the stick given by V\, call it o~i, and set it 
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aside. From the remainder (of length 1 — <Ji) break off the proportion given by v 2, call 
it cr 2 , and set it aside. The remainder of the stick is now 1 — a 2 — o~\ = (1 — Ui)(l — v%). 
Repeat this process for the infinite set of t>j. The resulting stick lengths <jj will sum to 
one. 

More formally, we define the function <7j to take the collection of realized V{ variables 
and to return the stick length of the ith component, 



(Ti. V 



(63) 



15=1(1 

and note that YliLi °"i( v ) = !• We call Vi the ith breaking proportion. 

Combining these steps, we form the distribution G according to the following process, 

A ~ G i 6 {1,2,3,...} 

vt ~ Beta(l,a) i G {1, 2,3,...} 

In the random distribution G the zth atom @i is an independent draw from Go and it 



has probability given by the ith stick length Uj(v). Sethuraman (1994) showed that 
the distribution of G is DP (a, Go). 

The most important property of G is the "clustering" property. Even though G places 
mass on a countably infinite set of atoms, N draws from G will tend to exhibit only a 
small number of them. (How many depends on the scalar a, as we described above.) 



Formally, this is most easily seen via other perspectives on the DP (Ferguson, 1973 



Blackwell and MacQueen, 1973, Pitman, 2002), though it can be seen intuitively with 



the stick-breaking construction. The intuition is that as a gets smaller more of the 
stick is absorbed in the first break locations because the breaking proportions are 
drawn from Beta(l,a). Thus, those atoms associated with the first breaks of ths 
stick will have larger mass in the distribution G, and that in turn encourages draws 
from the distribution to realize fewer individual atoms. In general, the first break 
locations tend to be larger than the later break locations. This property is called size 
biasedness. 



The HDP topic model. We now construct a Bayesian nonparametric topic model 
that has an "infinite" number of topics. The hierarchical Dirichlet process topic 
model (Teh et al. 2006a) is a two- level Dirichlet process. The base distribution 
H of the top-level DP is a symmetric Dirichlet over the vocabulary simplex — its 
atoms are topics. We draw once from this DP, Go ~ DP(w, H). In the second level, 
we use Go (drawn from the top-level DP) as a base measure to a document-level 
DP, Gd ~ DP(a,Go). We draw the words of each document from topics from Gd- 
The consequence of this two-level construction is that all documents share the same 
collection of topics but exhibit them with different proportions. 
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We construct the HDP topic model using a stick-breaking construction at each level — 
one at the document level and one at the corpus levelj^] The generative process of 
the HDP topic model is as follows. 

1. Draw an infinite number of topics, fa ~ Diry(^) for k G {1, 2, 3, . . .}. 

2. Draw corpus breaking proportions, Vk ~ Beta(l, u) for k E {1,2,3, .. .}. 

3. For each document d: 

(a) Draw document-level topic indices, c^i ~ Mult(<r(v)) for i E {1, 2, 3, . . .}. 

(b) Draw document breaking proportions, n^i ~ Beta(l, a) for i G {1, 2, 3, . . .}. 

(c) For each word n: 

i. Draw topic assignment Zd, n ~ Mult (cr (71^)). 

ii. Draw word w n , ~ Mult (/3 Cd z ). 



Figure [7] illustrates this process as a graphical model. 

In this construction, topics fa are drawn as in LDA (Step 1). Corpus-level breaking 
proportions v (Step 2) define a probability distribution on these topics, which indicates 
their relative prevalence in the corpus. At the document level, breaking proportions ir^i 
create a set of probabilities (Step 3b) and topic indices Cd,i, drawn from u{v), attach 
each document-level stick length to a topic (Step 3a). This creates a document-level 
distribution over topics, and words are then drawn as for LDA (Step 3c). 

The posterior distribution of the HDP topic model gives a mixed-membership decom- 
position of a corpus where the number of topics is unknown in advance and unbounded. 
However, it is not possible to compute the posterior. Approximate posterior inference 



for BNP models in general is an active field of research (Escobar and West, 1995, Neal 



2000 Blei and Jordan 2005 Teh et al. ( 2007). 



The main advantage of our construction is that it meets the conditions of Section [2j 
All the complete conditionals are in exponential families in closed form, and it neatly 
separates global variables from local variables. The global variables are topics and 
corpus-level breaking proportions; the local variables are document-level topic indices 
and breaking proportions. Following the same procedure as for LDA, we now derive 
stochastic variational inference for the HDP topic model. 



13. See the original HDP paper of Teh et al. (2006a I for other constructions of the HDP — the random 



measure construction, the construction by the Chinese restaurant franchise, and an alternative 



stick- breaking construction. This construction was alluded to by Fox et al. ( 2008 1 . It was formally 



proposed and first used for the HDP by Wang et al. (2011). 
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Complete conditionals and variational distributions. We form the complete 
conditional distributions of all variables in the HDP topic model. We begin with the 
latent indicator variables, 

p(zd,n = k\^d,/3i:K,w d>n ,c d ) oc exp{log <r k (ir d ) + Y%L 1 c d,k lo S Pi,w d ,„ }, (64) 
p(c d>i = k\v, p lsK , w d , z d ) oc exp{logo- fc (v) + J2n=l Z d,n lo g^ d ,„}- (65) 

Note the interaction between the two levels of latent indicators. In LDA the ith 
component of the topic proportions points to the ith topic. Here we must account for 
the topic index c d j, which is a random variable that points to one of the topics. 

This interaction between indicators is also seen in the conditionals for the topics, 

p(/3 k \z, c, w) = Dir (77 + J2d=l E*l C d,i En=l Z d,n W d,n) • (66) 

The innermost sum collects the sufficient statistics for words in the <ith document that 
are allocated to the ith local component index. However, these statistics are only kept 
when the ith topic index c d ^ points to the fcth global topic. 

The full conditionals for the breaking proportions follow those of a standard stick- 



breaking construction (Blei and Jordan, 2005) 



P (v k \c) = Beta (l + J2 d= i E*=i c h > w + E?=i E*=i E i>fc 4,i) > (67) 

p{K d>i \z d ) = Beta ^1 + J2n=l Z d,n > a + En=l J2j>i Z d,r) • (68) 

The complete conditionals for all the latent variables are all in the same family as 
their corresponding distributions in the generative process. Accordingly, we will 
define the variational distributions to be in the same family. However, the main 
difference between BNP models and parametric models is that BNP models contain 
an infinite number of hidden variables. These cannot be completely represented 
in the variational distribution as this would require optimizing an infinite number 
of variational parameters. We solve this problem by truncating the variational 



distribution (Blei and Jordan, 2005). At the corpus level, we truncate at K, fitting 
posteriors to K breaking points, K topics, and allowing the topic pointer variables to 
take on one of K values. At the document level we truncate at T, fitting T breaking 
proportions, T topic pointers, and letting the topic assignment variable take on one of 
T values. Thus the variational family is, 



q(/3, v, z, tt) = q(/3 k | \ k )q{v k \a k )\ ] [ Q q{c dii | ( d ,i)q(vd,i \ ld,i) JJ <l{zd. 



K \ / D T N 

' >,/.„) 

k=l / \d=l i=l ra=l 

We emphasize that this is not a finite model. With truncation levels set high enough, 
the variational posterior will use as many topics as the posterior needs, but will not 
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necessarily use all K topics to explain the observations. (If K is set too small then 
the truncated variational distribution will use all of the topics, but this problem can 
be easily diagnosed and corrected.) Further, a particular advantage of this two- level 
stick-breaking distribution is that the document truncation T can be much smaller 
than K. Though there may be hundreds of topics in a large corpus, we expect each 
document will only exhibit a small subset of them. 

Stochastic variational inference for HDP topic models. From the complete 
conditionals, batch variational inference proceeds by updating each variational pa- 
rameter using the expectation of its conditional distribution's natural parameter. In 
stochastic inference, we sample a data point, update its local parameters as for batch 
inference, and then update the global variables. 

To update the global topic parameters, we again form intermediate topics with the 
sampled document's optimized local parameters, 

\ k = r) + D Eli Eli EJ4,nK,„. (69) 

We then update the global variational parameters by taking a step in the direction of 
the stochastic natural gradient 

\^) = (1- Pt )\(t) +Pt \ k . (70) 
This parallels the update for LDA. 

The other global variables in the HDP are the corpus- level breaking proportions v k , 
each of which is associated with a set of beta parameters a k = (ar£\a^) f° r ^ s 
variational distribution. Using the same randomly selected document and optimized 
variational parameters as above, first construct the two-dimensional vector 

a k =(l + D Ef =1 E g [ C y , u) + D Ef =1 ££*+i Ejcy ) . (71) 

Then, update the parameters 

4 t+1) = (1 - Pt )af + p t a k . (72) 

Note that we use the truncations K and T. Figure [7] summarizes the complete 
conditionals, variational parameters, and relevant expectations for the full algorithm. 
Figure [8] gives the stochastic variational inference algorithm for the HDP topic model. 

Stochastic inference versus batch inference for the HDP. Figure [9] illustrates 
the performance of the HDP topic model on the same three large collections as in 
Figure [6j As for LDA, stochastic variational inference for the HDP converges faster 
and to a better model. 
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1: Initialize randomly. Set = 1 and b^ = cu. 
2: Set the step-size schedule p t appropriately. 
3: repeat 

4: Sample a document Wd uniformly from the data set. 
5: For j G {1,...,T} initialize 

Cli oc exp{En =1 E[log/3 Md J}, k G {1, . . . , K}. 

6: For n G {1, . . . , N} initialize 

<P\ n oc exp {Ef =1 Ci*E[tog/W]} , < e {1, . . . ,T}. 

7: repeat 

8: For ie{l,...,T} set 

7dJ = 1 + En=l $J,n> 

1d,i — a ' Z^n=l 2^j=i+l V^n 

Cj 4 oc exp {E[log(7,(V)] + Etx <„Epog)8*,^ n ]} , k € {1, . . . , AT}. 

9: For n G {1, . . . , N} set 

<t> d>n ex exp {E[log aifa)] + Ef=i CjiEflog^^j} , i G {1, . . . ,T}. 

10: until local parameters converge. 

11: For k G {1, . . . , iT} set intermediate topics 

X k>v = T] + D Yl=l (d,i Ell 0d,n W d,n 

a* = i + d EL & 

b k = u + D J2i =1 Hf= k+ i Cd,i- 

12: Set 

\^ = (1 - Pt )\^ + p t \ 
aV = (l-p t )a«-V+p t a 
= (1 - Pt )b^ + p t b. 

13: until forever 
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Figure 8: Stochastic variational inference for the HDP topic model. The corpus-level 
truncation is K; the document-level truncation as T. Relevant expectations 
are found in Figure [7} 
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Figure 9: The per- word predictive log likelihood for an HDP model on three large 
corpora. As for LDA, stochastic variational inference on the full data 
converges faster and to a better place than batch variational inference on a 
reasonably sized subset. Section [4] gives the details of our empirical study. 



4. Empirical Study 

We studied stochastic variational inference for latent Dirichlet allocation (LDA) and 
the hierarchical Dirichlet process (HDP) topic model. With these algorithms, we 
can apply and compare these models with very large collections of documents. We 
also investigated how the forgetting rate k and mini-batch size S influenced the 
algorithms. Finally, we compared stochastic variational inference to the traditional 
batch variational inference algorithm, 14 



Data. We evaluated our algorithms on three collections of documents. For each 
collection, we computed a vocabulary by removing stop words, rare words, and very 
frequent words. The data are as follows. 

• Nature: This collection contains 350,000 documents from the journal Nature 
(spanning the years 1869-2008). After processing, it contains 58M observed 
words from a vocabulary of 4,200 terms. 

• New York Times: This collection contains 1.8M documents from the New York 
Times (spanning the years 1987-2007). After processing, this data contains 
461M observed words from a vocabulary of 8,000 terms. 

14. We implemented all algorithms in Python using Numpy, making the implementations as similar 
as possible. All of our code is available on the web. 
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• Wikipedia: This collections contains 3.8M documents from Wikipedia. After 
processing, it contains 482M observed words from a vocabulary of 7,700 terms. 

For each collection, we set aside a test set of 10,000 documents for evaluating model 
fitness; these test sets were not given to the algorithms for training. 

Evaluating model fitness. We evaluate how well a model fits the data with the 
predictive distribution. We are given a corpus and estimate its topics. We then are 
given part of a test document, which we combine with the topics to form a predictive 
distribution of words. Under this predictive distribution, a better model will assign 
higher probability to the held-out words. 

In more detail, we divide each test document w into a set of observed words w Q b s and 
held-out words Wh Q . We approximate the posterior distribution of topics (5 from the 
training data T>, and then to use that approximate posterior to estimate the predictive 
distribution of words p(w \ w b s ,D). Finally, we evaluate the log probability of the 
words in u>ho under this distribution. 



This metric was used in Teh et al. (2007) and Asuncion et al. (2009). Unlike previous 



methods, like held-out perplexity in Blei et al. (2003), evaluating the predictive 



distribution avoids comparing bounds or forming approximations of the evaluation 
metric. It rewards a good predictive distribution, however it is computed. 

Operationally, we use the training data to compute variational Dirichlet parameters for 
the topics. We then use these parameters with the observed test wods w b s to compute 
the variational distribution of the topic proportions. Taking the inner product of the 
expected topics and the expected topic proportions gives the predictive distribution. 

To see this is a valid approximation, note the following for a i^-topic LDA model, 

p(w\V,w ohs ) = ^^{Ek=i^/3 k , w )p(e\w ohs ,/3)p((3\V) (73) 

~ / / (Ef=i 0kPk,J) mm ( 74 ) 

J (3 J 9 

= E q [6\w ohs ] T E q [(5., w \V]. (75) 

We have conditioned the expectations to make clear with which data the q distribution 
is fit. The metric independentally evaluates each held out word under this distribution. 
In the HDP, the reasoning is identical. The differences are that the topic proportions 
are computed via the two-level variational stick-breaking distribution and K is the 
truncation level of the approximate posterior. 
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Learning parameters. Stochastic variational inference introduces several param- 



eters in setting the learning rate schedule (see Equation 47). The forgetting rate 
re G (0.5, 1] controls how quickly old information is forgotten; the delay r > 
downweights early iterations; and the mini-batch size S is how many documents are 
subsampled and analyzed in each iteration. Although stochastic variational inference 
algorithm converges to a stationary point for any valid re, t, and S, the quality of this 
stationary point and the speed of convergence may depend on how these parameters 
are set. 



We set t = 1 and explored the following forgetting rates and minibatch sizes 

• Forgetting rate re G {0.5, 0.6, 0.7, 0.8, 0.9, 1.0} 

• Minibatch size S G {10, 50, 100, 500, 1000} 
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We periodically paused each run to compute predictive likelihoods from the test data. 

Results on LDA and HDP topic models. We studied LDA and the HDP. In 

LDA, we varied the number of topics K to be 25, 50, 100, 200 and 300; we set 
the Dirichlet hyperparameters a = 1/K. In the HDP, we set both concentration 
parameters 7 and a equal to 1; we set the top-level truncation K = 300 and the 
second level truncation T = 20. (Here T <C K because we do not expect documents to 
exhibit very many unique topics.) In both models, we set the topic Dirichlet parameter 



77 = 0.01. Figure 10 shows example topics from the HDP (on New York Times). 



Figure Y\_ gives the average predictive log likelihood for both models. We report the 
value for a forgetting rate re = 0.9 and a batch size of 500. Stochastic inference lets us 
perform a large scale comparison of these models. The HDP gives consistently better 
performance. For larger numbers of topics, LDA overfits the data. As the modeling 
assumptions promise, the HDP stays robust to overfitting, 
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We now turn to the sensitivity of stochastic inference to its learning parameters. First, 
we consider the HDP (the algorithm presented in Figure [8]). We fixed the batch size 
to 500 and explored the forgetting ratej^ Figure 12 shows the results on all three 



corpora. All three fits were sensitive to the forgetting rate; we see that a higher value 
(i.e., close to one) leads to convergence to a better optimum. 

15. We also explored various values of the delay r, but found that the algorithms were not sensitive. 
To make this presentaton simpler, we fixed r = 1 in our report of the empirical study. 

16. Though not illustrated, we note that using the traditional measure of fit, held-out perplexity, 
does not reveal this overfitting (though the HDP still outperforms LDA with that metric as well). 
We feel that the predictive distribution is a better metric for model fitness. 

17. We fit distributions using the entire grid of parameters described above. However, to simplify 
presenting results we will hold one of the parameters fixed and vary the other. 
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Figure 10: The 15 most frequent topics from the HDP posterior on the New York 
Times. Each topic plot illustrates the topic's most frequent words. 
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Figure 11: Stochastic inference lets us compare performance on several large data sets. 

We fixed the forgetting rate k = 0.9 and the batch size to 500 documents. 
We find that LDA is sensitive to the number of topics; the HDP gives 
consistently better predictive performance. Traditional variational inference 
(on subsets of each corpus) did not perform as well as stochastic inference. 
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Figure 12: HDP inference: Holding the batch size fixed at 500, we varied the forgetting 
rate k. Slower forgetting rates are preferred. 
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Figure 13: HDP inference: Holding the forgetting rate k fixed at 0.9, we varied the 
batch size. Batch sizes may be set too small (e.g., ten documents) but the 
difference in performance is small once set high enough. 



Fixing the forgetting rate to 0.9, we explored various mini-batch sizes. Figure 13 shows 
the results on all three corpora. Batch sizes that are too small (e.g., ten documents) 
can affect performance; larger batch sizes are preferred. That said, there was not a 
big difference between batch sizes of 500 and 1,000. The New York Times corpus was 
most sensitive to batch size; the Wikipedia corpus was least sensitive. 



Figure 15 illustrate LDA's sensitivity to the forgetting rate and batch size, respectively. 



Again, we find that large learning rates and batch sizes perform well. 



5. Discussion 

We developed stochastic variational inference, a scalable variational inference algorithm 
that lets us analyze very large data sets with complex probabilistic models. The main 
idea is to use stochastic optimization to optimize the variational objective, following 
noisy esimates of the natural gradient where the noise arises by repeatedly subsampling 
the data. We illustrated this approach with two probabilistic topic models, latent 
Dirichlet allocation and the hierarchical Dirichlet process topic model. With stochastic 
variational inference, we can easily apply topic modeling to collections of millions 
documents. More importantly, this algorithm generalizes to many settings. 

Stochastic variational inference opens the door to several promising research directions. 
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Figure 14: 100-topic LDA inference: Holding the batch size fixed at 500, we varied 
the forgetting rate k. Slower forgetting rates are preferred. 
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Figure 15: 100-topic LDA inference: Holding the learning rate k fixed at 0.9, we varied 
the batch size. Bigger batch sizes are preferred. 



We developed our algorithm with conjugate exponential family models. This class of 
models is expressive, but nonconugate models — models where a richer prior is used at 
the expense of mathematical convenience — have expanded the suite of probabilistic 
tools at our disposal. For example, nonconjugate models can capture correlations 



between topics (Blei and Lafferty 2007) or topics changing over time (Blei and Lafferty 



2006, Wang et al. , 2008), and the algorithm presented here cannot be used in these 



settings. (In other work, Paisley et al. (2012b) developed a stochastic variational 
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inference algorithm for a specific nonconjugate Bayesian nonparametric model.) Recent 



research has developed general methods for non-conjugate models (Knowles and Minka 



2011; Gershman et al. , 2012, Paisley et al., 2012a). Can these be scaled up with 



stochastic optimization? 

We developed our algorithm with mean-field variational inference and closed form 
coordinate updates. Another promising direction is to use stochastic optimization to 
scale up recent advances in variational inference, moving beyond closed form updates 
and fully factorized approximate posteriors. As one example, collapsed variational 



inference (Teh et al. 2006b, 2007) marginalizes out some of the hidden variables, 



trading closed-form updates for a more focused posterior. In related work, we developed 
a stochastic variational algorithm for fitting topic models — combining MCMC at a 
local level and variational inference at a global level (Mimno et al. , 2012) — but could 



this be generalized to the full setting considered here? As another example, structured 
variational distributions let us handle more complex posteriors, such as those arising 



in time-series models (Ghahramani and Jordan 1997, Blei and Lafferty 2006). Being 



able to relax the mean-field assumption would expand the repertoire of models that 
we can use on large analysis problems. 

Finally, our algorithm lets us potentially connect innovations in stochastic optimization 



to better methods for approximate posterior inference. For example, Wahabzada and 



Kersting (2011) sample from data non-uniformly to better focus on more informative 



data points. We might also consider data whose distribution changes over time, such 
as when we want to model an infinite stream of data but to "forget" data from the far 
past in a current estimate of the model. Last, we can analyze our estimates of the 
gradient. How is it affected by sparse data? Are there ways to reduce its variance, 
but maintain its unbiasedness? 
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